Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [http://machinelearningmastery.com/]
SUMMARY: [Sample Paragraph - The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Connectionist Bench dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.]
INTRODUCTION: [Sample Paragraph - The data file patterns obtained by bouncing sonar signals off a metal cylinder or a rock at various angles and under various conditions. The transmitted sonar signal is a frequency-modulated chirp, rising in frequency. The data set contains signals obtained from a variety of different aspect angles, spanning 90 degrees for the cylinder and 180 degrees for the rock. Each pattern is a set of 60 numbers in the range 0.0 to 1.0. Each number represents the energy within a particular frequency band, integrated over a certain period of time.]
ANALYSIS: [Sample Paragraph - The baseline performance of the machine learning algorithms achieved an average accuracy of 78.39%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 85.44%. After applying the optimized parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 80.39%, which was below the prediction from the training data.]
CONCLUSION: [Sample Paragraph - For this iteration, the Random Forest algorithm achieved the best overall training and validation results. For this dataset, the Random Forest algorithm could be considered for further modeling.]
Dataset Used: [Connectionist Bench (Sonar, Mines vs. Rocks) Data Set]
Dataset ML Model: Binary classification with [numerical | categorical] attributes
Dataset Reference: [https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar%2C+Mines+vs.+Rocks%29]
One potential source of performance benchmarks: [https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar%2C+Mines+vs.+Rocks%29]
The project aims to touch on the following areas:
Any predictive modeling machine learning project genrally can be broken down into about six major tasks:
startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
library(corrplot)
## corrplot 0.84 loaded
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(Hmisc)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(mailR)
## Registered S3 method overwritten by 'R.oo':
## method from
## throw.default R.methodsS3
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(stringr)
# Create the random seed number for reproducible results
seedNum <- 888
# Set up the notifyStatus flag to stop sending progress emails (setting to TRUE will send status emails!)
notifyStatus <- FALSE
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1)
metricTarget <- "Accuracy"
# Set up the email notification function
email_notify <- function(msg=""){
sender <- Sys.getenv("MAIL_SENDER")
receiver <- Sys.getenv("MAIL_RECEIVER")
gateway <- Sys.getenv("SMTP_GATEWAY")
smtpuser <- Sys.getenv("SMTP_USERNAME")
password <- Sys.getenv("SMTP_PASSWORD")
sbj_line <- "Notification from R Binary Classification Script"
send.mail(
from = sender,
to = receiver,
subject= sbj_line,
body = msg,
smtp = list(host.name = gateway, port = 587, user.name = smtpuser, passwd = password, ssl = TRUE),
authenticate = TRUE,
send = TRUE)
}
if (notifyStatus) email_notify(paste("Library and Data Loading has begun!",date()))
# Slicing up the document path to get the final destination file name
dataset_path <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data'
doc_path_list <- str_split(dataset_path, "/")
dest_file <- doc_path_list[[1]][length(doc_path_list[[1]])]
if (!file.exists(dest_file)) {
# Download the document from the website
cat("Downloading", dataset_path, "as", dest_file, "\n")
download.file(dataset_path, dest_file, mode = "wb")
cat(dest_file, "downloaded!\n")
# unzip(dest_file)
# cat(dest_file, "unpacked!\n")
}
inputFile <- dest_file
colNames <- paste0("attr",1:60)
colNames <- c(colNames, 'targetVar')
Xy_original <- read.csv(inputFile, sep=',', header=FALSE, col.names = colNames)
# Different ways of reading and processing the input dataset. Saving these for future references.
#X_train <- read.fwf("X_train.txt", widths = widthVector, col.names = colNames)
#y_train <- read.csv("y_train.txt", header = FALSE, col.names = c("targetVar"))
#y_train$targetVar <- as.factor(y_train$targetVar)
#Xy_train <- cbind(X_train, y_train)
# Take a peek at the dataframe after the import
head(Xy_original)
## attr1 attr2 attr3 attr4 attr5 attr6 attr7 attr8 attr9 attr10
## 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111
## 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872
## 3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194
## 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264
## 5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459
## 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039
## attr11 attr12 attr13 attr14 attr15 attr16 attr17 attr18 attr19 attr20
## 1 0.1609 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797
## 2 0.4918 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818
## 3 0.6333 0.7060 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619
## 4 0.0881 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973
## 5 0.4152 0.3952 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636
## 6 0.2988 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122
## attr21 attr22 attr23 attr24 attr25 attr26 attr27 attr28 attr29 attr30
## 1 0.5783 0.5071 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857
## 2 0.5212 0.4052 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028
## 3 0.7974 0.6737 0.4293 0.3648 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514
## 4 0.2741 0.3690 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559
## 5 0.4148 0.4292 0.5730 0.5399 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724
## 6 0.2074 0.3985 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067
## attr31 attr32 attr33 attr34 attr35 attr36 attr37 attr38 attr39 attr40
## 1 0.1307 0.2604 0.5121 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744
## 2 0.3788 0.2947 0.1984 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970
## 3 0.8512 0.5045 0.1862 0.2709 0.4232 0.3043 0.6116 0.6756 0.5375 0.4719
## 4 0.6260 0.7340 0.6120 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167
## 5 0.5103 0.5459 0.2881 0.0981 0.1951 0.4181 0.4604 0.3217 0.2828 0.2430
## 6 0.5580 0.4778 0.3299 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296
## attr41 attr42 attr43 attr44 attr45 attr46 attr47 attr48 attr49 attr50
## 1 0.0510 0.2834 0.2825 0.4256 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324
## 2 0.1674 0.0583 0.1401 0.1628 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061
## 3 0.4647 0.2587 0.2129 0.2222 0.2111 0.0176 0.1348 0.0744 0.0130 0.0106
## 4 0.6121 0.5006 0.3210 0.3202 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294
## 5 0.1979 0.2444 0.1847 0.0841 0.0692 0.0528 0.0357 0.0085 0.0230 0.0046
## 6 0.2707 0.2650 0.0723 0.1238 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081
## attr51 attr52 attr53 attr54 attr55 attr56 attr57 attr58 attr59 attr60
## 1 0.0232 0.0027 0.0065 0.0159 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032
## 2 0.0125 0.0084 0.0089 0.0048 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044
## 3 0.0033 0.0232 0.0166 0.0095 0.0180 0.0244 0.0316 0.0164 0.0095 0.0078
## 4 0.0241 0.0121 0.0036 0.0150 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117
## 5 0.0156 0.0031 0.0054 0.0105 0.0110 0.0015 0.0072 0.0048 0.0107 0.0094
## 6 0.0104 0.0045 0.0014 0.0038 0.0013 0.0089 0.0057 0.0027 0.0051 0.0062
## targetVar
## 1 R
## 2 R
## 3 R
## 4 R
## 5 R
## 6 R
sapply(Xy_original, class)
## attr1 attr2 attr3 attr4 attr5 attr6 attr7
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr8 attr9 attr10 attr11 attr12 attr13 attr14
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr15 attr16 attr17 attr18 attr19 attr20 attr21
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr22 attr23 attr24 attr25 attr26 attr27 attr28
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr29 attr30 attr31 attr32 attr33 attr34 attr35
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr36 attr37 attr38 attr39 attr40 attr41 attr42
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr43 attr44 attr45 attr46 attr47 attr48 attr49
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr50 attr51 attr52 attr53 attr54 attr55 attr56
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr57 attr58 attr59 attr60 targetVar
## "numeric" "numeric" "numeric" "numeric" "factor"
sapply(Xy_original, function(x) sum(is.na(x)))
## attr1 attr2 attr3 attr4 attr5 attr6 attr7
## 0 0 0 0 0 0 0
## attr8 attr9 attr10 attr11 attr12 attr13 attr14
## 0 0 0 0 0 0 0
## attr15 attr16 attr17 attr18 attr19 attr20 attr21
## 0 0 0 0 0 0 0
## attr22 attr23 attr24 attr25 attr26 attr27 attr28
## 0 0 0 0 0 0 0
## attr29 attr30 attr31 attr32 attr33 attr34 attr35
## 0 0 0 0 0 0 0
## attr36 attr37 attr38 attr39 attr40 attr41 attr42
## 0 0 0 0 0 0 0
## attr43 attr44 attr45 attr46 attr47 attr48 attr49
## 0 0 0 0 0 0 0
## attr50 attr51 attr52 attr53 attr54 attr55 attr56
## 0 0 0 0 0 0 0
## attr57 attr58 attr59 attr60 targetVar
## 0 0 0 0 0
# Not applicable for this iteration of the project
# Sample code for performing basic data cleaning tasks
# Dropping features
# Xy_original$column_name <- NULL
# Mark missing values
# invalid <- 0
# Xy_original$column_name[Xy_original$column_name==invalid] <- NA
# Impute missing values
# column_median <- median(Xy_original$column_name, na.rm = TRUE)
# Xy_original$column_name[Xy_original$column_name==0] <- column_median
# Xy_original$column_name <- with(Xy_original, impute(column_name, cholumn_median))
# Convert columns from one data type to another
# Xy_original$column_name <- as.integer(Xy_original$column_name)
# Xy_original$column_name <- as.factor(Xy_original$column_name)
# Take a peek at the dataframe after the cleaning
head(Xy_original)
## attr1 attr2 attr3 attr4 attr5 attr6 attr7 attr8 attr9 attr10
## 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111
## 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872
## 3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194
## 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264
## 5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459
## 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039
## attr11 attr12 attr13 attr14 attr15 attr16 attr17 attr18 attr19 attr20
## 1 0.1609 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797
## 2 0.4918 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818
## 3 0.6333 0.7060 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619
## 4 0.0881 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973
## 5 0.4152 0.3952 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636
## 6 0.2988 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122
## attr21 attr22 attr23 attr24 attr25 attr26 attr27 attr28 attr29 attr30
## 1 0.5783 0.5071 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857
## 2 0.5212 0.4052 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028
## 3 0.7974 0.6737 0.4293 0.3648 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514
## 4 0.2741 0.3690 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559
## 5 0.4148 0.4292 0.5730 0.5399 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724
## 6 0.2074 0.3985 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067
## attr31 attr32 attr33 attr34 attr35 attr36 attr37 attr38 attr39 attr40
## 1 0.1307 0.2604 0.5121 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744
## 2 0.3788 0.2947 0.1984 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970
## 3 0.8512 0.5045 0.1862 0.2709 0.4232 0.3043 0.6116 0.6756 0.5375 0.4719
## 4 0.6260 0.7340 0.6120 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167
## 5 0.5103 0.5459 0.2881 0.0981 0.1951 0.4181 0.4604 0.3217 0.2828 0.2430
## 6 0.5580 0.4778 0.3299 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296
## attr41 attr42 attr43 attr44 attr45 attr46 attr47 attr48 attr49 attr50
## 1 0.0510 0.2834 0.2825 0.4256 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324
## 2 0.1674 0.0583 0.1401 0.1628 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061
## 3 0.4647 0.2587 0.2129 0.2222 0.2111 0.0176 0.1348 0.0744 0.0130 0.0106
## 4 0.6121 0.5006 0.3210 0.3202 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294
## 5 0.1979 0.2444 0.1847 0.0841 0.0692 0.0528 0.0357 0.0085 0.0230 0.0046
## 6 0.2707 0.2650 0.0723 0.1238 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081
## attr51 attr52 attr53 attr54 attr55 attr56 attr57 attr58 attr59 attr60
## 1 0.0232 0.0027 0.0065 0.0159 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032
## 2 0.0125 0.0084 0.0089 0.0048 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044
## 3 0.0033 0.0232 0.0166 0.0095 0.0180 0.0244 0.0316 0.0164 0.0095 0.0078
## 4 0.0241 0.0121 0.0036 0.0150 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117
## 5 0.0156 0.0031 0.0054 0.0105 0.0110 0.0015 0.0072 0.0048 0.0107 0.0094
## 6 0.0104 0.0045 0.0014 0.0038 0.0013 0.0089 0.0057 0.0027 0.0051 0.0062
## targetVar
## 1 R
## 2 R
## 3 R
## 4 R
## 5 R
## 6 R
sapply(Xy_original, class)
## attr1 attr2 attr3 attr4 attr5 attr6 attr7
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr8 attr9 attr10 attr11 attr12 attr13 attr14
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr15 attr16 attr17 attr18 attr19 attr20 attr21
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr22 attr23 attr24 attr25 attr26 attr27 attr28
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr29 attr30 attr31 attr32 attr33 attr34 attr35
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr36 attr37 attr38 attr39 attr40 attr41 attr42
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr43 attr44 attr45 attr46 attr47 attr48 attr49
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr50 attr51 attr52 attr53 attr54 attr55 attr56
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr57 attr58 attr59 attr60 targetVar
## "numeric" "numeric" "numeric" "numeric" "factor"
sapply(Xy_original, function(x) sum(is.na(x)))
## attr1 attr2 attr3 attr4 attr5 attr6 attr7
## 0 0 0 0 0 0 0
## attr8 attr9 attr10 attr11 attr12 attr13 attr14
## 0 0 0 0 0 0 0
## attr15 attr16 attr17 attr18 attr19 attr20 attr21
## 0 0 0 0 0 0 0
## attr22 attr23 attr24 attr25 attr26 attr27 attr28
## 0 0 0 0 0 0 0
## attr29 attr30 attr31 attr32 attr33 attr34 attr35
## 0 0 0 0 0 0 0
## attr36 attr37 attr38 attr39 attr40 attr41 attr42
## 0 0 0 0 0 0 0
## attr43 attr44 attr45 attr46 attr47 attr48 attr49
## 0 0 0 0 0 0 0
## attr50 attr51 attr52 attr53 attr54 attr55 attr56
## 0 0 0 0 0 0 0
## attr57 attr58 attr59 attr60 targetVar
## 0 0 0 0 0
# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(Xy_original)
# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization!
targetCol <- totCol
# Standardize the class column to the name of targetVar if applicable
#colnames(Xy_original)[targetCol] <- "targetVar"
#Xy_original$targetVar <- relevel(Xy_original$targetVar,"pos")
# Create various sub-datasets for visualization and cleaning/transformation operations.
set.seed(seedNum)
# Use 75% of the data to train the models and the remaining for testing/validation
training_index <- createDataPartition(Xy_original$targetVar, p=0.75, list=FALSE)
Xy_train <- Xy_original[training_index,]
Xy_test <- Xy_original[-training_index,]
if (targetCol==1) {
X_train <- Xy_train[,(targetCol+1):totCol]
y_train <- Xy_train[,targetCol]
y_test <- Xy_test[,targetCol]
} else {
X_train <- Xy_train[,1:(totAttr)]
y_train <- Xy_train[,totCol]
y_test <- Xy_test[,totCol]
}
# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 4
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row): 4 by 15
if (notifyStatus) email_notify(paste("Library and Data Loading completed!",date()))
To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.
if (notifyStatus) email_notify(paste("Data Summarization and Visualization has begun!",date()))
head(Xy_train)
## attr1 attr2 attr3 attr4 attr5 attr6 attr7 attr8 attr9 attr10
## 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111
## 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872
## 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264
## 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039
## 7 0.0317 0.0956 0.1321 0.1408 0.1674 0.1710 0.0731 0.1401 0.2083 0.3513
## 8 0.0519 0.0548 0.0842 0.0319 0.1158 0.0922 0.1027 0.0613 0.1465 0.2838
## attr11 attr12 attr13 attr14 attr15 attr16 attr17 attr18 attr19 attr20
## 1 0.1609 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797
## 2 0.4918 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818
## 4 0.0881 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973
## 6 0.2988 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122
## 7 0.1786 0.0658 0.0513 0.3752 0.5419 0.5440 0.5150 0.4262 0.2024 0.4233
## 8 0.2802 0.3086 0.2657 0.3801 0.5626 0.4376 0.2617 0.1199 0.6676 0.9402
## attr21 attr22 attr23 attr24 attr25 attr26 attr27 attr28 attr29 attr30
## 1 0.5783 0.5071 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857
## 2 0.5212 0.4052 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028
## 4 0.2741 0.3690 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559
## 6 0.2074 0.3985 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067
## 7 0.7723 0.9735 0.9390 0.5559 0.5268 0.6826 0.5713 0.5429 0.2177 0.2149
## 8 0.7832 0.5352 0.6809 0.9174 0.7613 0.8220 0.8872 0.6091 0.2967 0.1103
## attr31 attr32 attr33 attr34 attr35 attr36 attr37 attr38 attr39 attr40
## 1 0.1307 0.2604 0.5121 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744
## 2 0.3788 0.2947 0.1984 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970
## 4 0.6260 0.7340 0.6120 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167
## 6 0.5580 0.4778 0.3299 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296
## 7 0.5811 0.6323 0.2965 0.1873 0.2969 0.5163 0.6153 0.4283 0.5479 0.6133
## 8 0.1318 0.0624 0.0990 0.4006 0.3666 0.1050 0.1915 0.3930 0.4288 0.2546
## attr41 attr42 attr43 attr44 attr45 attr46 attr47 attr48 attr49 attr50
## 1 0.0510 0.2834 0.2825 0.4256 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324
## 2 0.1674 0.0583 0.1401 0.1628 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061
## 4 0.6121 0.5006 0.3210 0.3202 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294
## 6 0.2707 0.2650 0.0723 0.1238 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081
## 7 0.5017 0.2377 0.1957 0.1749 0.1304 0.0597 0.1124 0.1047 0.0507 0.0159
## 8 0.1151 0.2196 0.1879 0.1437 0.2146 0.2360 0.1125 0.0254 0.0285 0.0178
## attr51 attr52 attr53 attr54 attr55 attr56 attr57 attr58 attr59 attr60
## 1 0.0232 0.0027 0.0065 0.0159 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032
## 2 0.0125 0.0084 0.0089 0.0048 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044
## 4 0.0241 0.0121 0.0036 0.0150 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117
## 6 0.0104 0.0045 0.0014 0.0038 0.0013 0.0089 0.0057 0.0027 0.0051 0.0062
## 7 0.0195 0.0201 0.0248 0.0131 0.0070 0.0138 0.0092 0.0143 0.0036 0.0103
## 8 0.0052 0.0081 0.0120 0.0045 0.0121 0.0097 0.0085 0.0047 0.0048 0.0053
## targetVar
## 1 R
## 2 R
## 4 R
## 6 R
## 7 R
## 8 R
dim(Xy_train)
## [1] 157 61
sapply(Xy_train, class)
## attr1 attr2 attr3 attr4 attr5 attr6 attr7
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr8 attr9 attr10 attr11 attr12 attr13 attr14
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr15 attr16 attr17 attr18 attr19 attr20 attr21
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr22 attr23 attr24 attr25 attr26 attr27 attr28
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr29 attr30 attr31 attr32 attr33 attr34 attr35
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr36 attr37 attr38 attr39 attr40 attr41 attr42
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr43 attr44 attr45 attr46 attr47 attr48 attr49
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr50 attr51 attr52 attr53 attr54 attr55 attr56
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## attr57 attr58 attr59 attr60 targetVar
## "numeric" "numeric" "numeric" "numeric" "factor"
summary(Xy_train)
## attr1 attr2 attr3 attr4
## Min. :0.00150 Min. :0.00220 Min. :0.00300 Min. :0.00610
## 1st Qu.:0.01350 1st Qu.:0.01720 1st Qu.:0.01900 1st Qu.:0.02450
## Median :0.02310 Median :0.03090 Median :0.03460 Median :0.04320
## Mean :0.02978 Mean :0.03951 Mean :0.04569 Mean :0.05538
## 3rd Qu.:0.03650 3rd Qu.:0.04770 3rd Qu.:0.06040 3rd Qu.:0.06330
## Max. :0.13710 Max. :0.23390 Max. :0.30590 Max. :0.42640
## attr5 attr6 attr7 attr8
## Min. :0.00670 Min. :0.0102 Min. :0.0033 Min. :0.0055
## 1st Qu.:0.03700 1st Qu.:0.0679 1st Qu.:0.0843 1st Qu.:0.0802
## Median :0.06170 Median :0.0924 Median :0.1098 Median :0.1130
## Mean :0.07504 Mean :0.1057 Mean :0.1234 Mean :0.1343
## 3rd Qu.:0.09950 3rd Qu.:0.1354 3rd Qu.:0.1597 3rd Qu.:0.1694
## Max. :0.40100 Max. :0.3823 Max. :0.3729 Max. :0.4566
## attr9 attr10 attr11 attr12
## Min. :0.0298 Min. :0.0113 Min. :0.0327 Min. :0.0236
## 1st Qu.:0.0974 1st Qu.:0.1186 1st Qu.:0.1445 1st Qu.:0.1382
## Median :0.1552 Median :0.1895 Median :0.2309 Median :0.2484
## Mean :0.1796 Mean :0.2084 Mean :0.2360 Mean :0.2490
## 3rd Qu.:0.2361 3rd Qu.:0.2718 3rd Qu.:0.3003 3rd Qu.:0.3259
## Max. :0.6828 Max. :0.7106 Max. :0.7342 Max. :0.6552
## attr13 attr14 attr15 attr16
## Min. :0.0184 Min. :0.0273 Min. :0.0031 Min. :0.0162
## 1st Qu.:0.1770 1st Qu.:0.1806 1st Qu.:0.1721 1st Qu.:0.2036
## Median :0.2510 Median :0.2904 Median :0.2950 Median :0.3234
## Mean :0.2775 Mean :0.3055 Mean :0.3299 Mean :0.3853
## 3rd Qu.:0.3603 3rd Qu.:0.3940 3rd Qu.:0.4725 3rd Qu.:0.5392
## Max. :0.7131 Max. :0.9970 Max. :1.0000 Max. :0.9988
## attr17 attr18 attr19 attr20
## Min. :0.0349 Min. :0.0689 Min. :0.0494 Min. :0.0740
## 1st Qu.:0.2088 1st Qu.:0.2349 1st Qu.:0.2989 1st Qu.:0.3658
## Median :0.3232 Median :0.3655 Median :0.4327 Median :0.5167
## Mean :0.4221 Mean :0.4556 Mean :0.5064 Mean :0.5616
## 3rd Qu.:0.6687 3rd Qu.:0.6888 3rd Qu.:0.7309 3rd Qu.:0.8092
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## attr21 attr22 attr23 attr24
## Min. :0.0512 Min. :0.0689 Min. :0.0563 Min. :0.0239
## 1st Qu.:0.3906 1st Qu.:0.4075 1st Qu.:0.4611 1st Qu.:0.5470
## Median :0.6079 Median :0.6708 Median :0.7022 Median :0.7114
## Mean :0.6094 Mean :0.6325 Mean :0.6598 Mean :0.6861
## 3rd Qu.:0.8240 3rd Qu.:0.8515 3rd Qu.:0.8626 3rd Qu.:0.8675
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## attr25 attr26 attr27 attr28
## Min. :0.0885 Min. :0.0921 Min. :0.0481 Min. :0.0284
## 1st Qu.:0.5734 1st Qu.:0.5599 1st Qu.:0.5389 1st Qu.:0.5116
## Median :0.7152 Median :0.7529 Median :0.7567 Median :0.7353
## Mean :0.6830 Mean :0.7074 Mean :0.7064 Mean :0.6831
## 3rd Qu.:0.8675 3rd Qu.:0.8938 3rd Qu.:0.9180 3rd Qu.:0.8752
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## attr29 attr30 attr31 attr32
## Min. :0.0144 Min. :0.0613 Min. :0.0482 Min. :0.0404
## 1st Qu.:0.4488 1st Qu.:0.3917 1st Qu.:0.3139 1st Qu.:0.2822
## Median :0.6790 Median :0.5986 Median :0.4770 Median :0.4219
## Mean :0.6323 Mean :0.5743 Mean :0.4968 Mean :0.4354
## 3rd Qu.:0.8477 3rd Qu.:0.7575 3rd Qu.:0.6407 3rd Qu.:0.5749
## Max. :1.0000 Max. :1.0000 Max. :0.9657 Max. :0.9306
## attr33 attr34 attr35 attr36
## Min. :0.0477 Min. :0.0212 Min. :0.0223 Min. :0.0271
## 1st Qu.:0.2584 1st Qu.:0.2175 1st Qu.:0.1757 1st Qu.:0.1547
## Median :0.3903 Median :0.3409 Median :0.3108 Median :0.3195
## Mean :0.4122 Mean :0.3938 Mean :0.3859 Mean :0.3821
## 3rd Qu.:0.5409 3rd Qu.:0.5962 3rd Qu.:0.5902 3rd Qu.:0.5564
## Max. :0.9708 Max. :0.9647 Max. :1.0000 Max. :1.0000
## attr37 attr38 attr39 attr40
## Min. :0.0351 Min. :0.0383 Min. :0.0371 Min. :0.0117
## 1st Qu.:0.1644 1st Qu.:0.1736 1st Qu.:0.1694 1st Qu.:0.1848
## Median :0.3201 Median :0.3101 Median :0.2829 Median :0.2729
## Mean :0.3602 Mean :0.3337 Mean :0.3170 Mean :0.3020
## 3rd Qu.:0.5144 3rd Qu.:0.4374 3rd Qu.:0.4145 3rd Qu.:0.4158
## Max. :0.9497 Max. :1.0000 Max. :0.9857 Max. :0.9167
## attr41 attr42 attr43 attr44
## Min. :0.0360 Min. :0.0056 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1581 1st Qu.:0.1466 1st Qu.:0.1552 1st Qu.:0.1262
## Median :0.2558 Median :0.2331 Median :0.2211 Median :0.1749
## Mean :0.2780 Mean :0.2660 Mean :0.2410 Mean :0.2113
## 3rd Qu.:0.3717 3rd Qu.:0.3712 3rd Qu.:0.3141 3rd Qu.:0.2694
## Max. :0.7322 Max. :0.8246 Max. :0.7517 Max. :0.5772
## attr45 attr46 attr47 attr48
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0944 1st Qu.:0.0690 1st Qu.:0.0621 1st Qu.:0.04200
## Median :0.1467 Median :0.1234 Median :0.1043 Median :0.07450
## Mean :0.1962 Mean :0.1596 Mean :0.1190 Mean :0.08803
## 3rd Qu.:0.2341 3rd Qu.:0.2001 3rd Qu.:0.1492 3rd Qu.:0.11640
## Max. :0.7034 Max. :0.7292 Max. :0.5522 Max. :0.33390
## attr49 attr50 attr51 attr52
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.00080
## 1st Qu.:0.02420 1st Qu.:0.01200 1st Qu.:0.0086 1st Qu.:0.00780
## Median :0.04220 Median :0.01850 Median :0.0140 Median :0.01180
## Mean :0.05022 Mean :0.02051 Mean :0.0162 Mean :0.01372
## 3rd Qu.:0.06810 3rd Qu.:0.02650 3rd Qu.:0.0209 3rd Qu.:0.01670
## Max. :0.16080 Max. :0.06370 Max. :0.1004 Max. :0.07090
## attr53 attr54 attr55 attr56
## Min. :0.00050 Min. :0.00110 Min. :0.000600 Min. :0.000600
## 1st Qu.:0.00490 1st Qu.:0.00550 1st Qu.:0.003900 1st Qu.:0.004800
## Median :0.00930 Median :0.00930 Median :0.007400 Median :0.007300
## Mean :0.01036 Mean :0.01113 Mean :0.009145 Mean :0.008222
## 3rd Qu.:0.01430 3rd Qu.:0.01450 3rd Qu.:0.012100 3rd Qu.:0.011100
## Max. :0.03170 Max. :0.03520 Max. :0.037600 Max. :0.032600
## attr57 attr58 attr59 attr60
## Min. :0.000300 Min. :0.000300 Min. :0.0001 Min. :0.000600
## 1st Qu.:0.003800 1st Qu.:0.003700 1st Qu.:0.0040 1st Qu.:0.003300
## Median :0.006500 Median :0.006300 Median :0.0067 Median :0.005400
## Mean :0.007862 Mean :0.008136 Mean :0.0082 Mean :0.006731
## 3rd Qu.:0.010500 3rd Qu.:0.010700 3rd Qu.:0.0109 3rd Qu.:0.008700
## Max. :0.025800 Max. :0.037700 Max. :0.0364 Max. :0.043900
## targetVar
## M:84
## R:73
##
##
##
##
cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)
## freq percentage
## M 84 53.50318
## R 73 46.49682
# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
boxplot(X_train[,i], main=names(X_train)[i])
}
# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
hist(X_train[,i], main=names(X_train)[i])
}
# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
plot(density(X_train[,i]), main=names(X_train)[i])
}
# Correlation matrix
correlations <- cor(X_train)
corrplot(correlations, method="circle")
if (notifyStatus) email_notify(paste("Data Summarization and Visualization completed!",date()))
Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:
if (notifyStatus) email_notify(paste("Data Cleaning and Transformation has begun!",date()))
# Not applicable for this iteration of the project
# Sample code for performing SMOTE transformation on the training data
# set.seed(seedNum)
# Xy_train <- SMOTE(targetVar ~., data=Xy_train, perc.over=200, perc.under=300)
# totCol <- ncol(Xy_train)
# y_train <- Xy_train[,totCol]
# cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)
# Not applicable for this iteration of the project
# Sample Code for finding collinear features (Block #1 of 2)
# Using the correlations calculated previously, we try to find attributes that are highly correlated.
# highlyCorrelated <- findCorrelation(correlations, cutoff=0.85)
# print(highlyCorrelated)
# cat('Number of attributes found to be highly correlated:',length(highlyCorrelated))
# Sample Code for finding collinear features (Block #2 of 2)
# Removing the highly correlated attributes from the training and validation dataframes
# Xy_train <- Xy_train[, -highlyCorrelated]
# Xy_test <- Xy_test[, -highlyCorrelated]
# Not applicable for this iteration of the project
# Sample code for performing Attribute Importance Ranking (Block #1 of 3)
# startTimeModule <- proc.time()
# set.seed(seedNum)
# library(gbm)
# model_fs <- train(targetVar~., data=Xy_train, method="gbm", preProcess="scale", trControl=control, verbose=F)
# rankedImportance <- varImp(model_fs, scale=FALSE)
# print(rankedImportance)
# plot(rankedImportance)
# Sample code for performing Attribute Importance Ranking (Block #2 of 3)
# Set the importance threshold and calculate the list of attributes that don't contribute to the importance threshold
# maxThreshold <- 0.99
# rankedAttributes <- rankedImportance$importance
# rankedAttributes <- rankedAttributes[order(-rankedAttributes$Overall),,drop=FALSE]
# totalWeight <- sum(rankedAttributes)
# i <- 1
# accumWeight <- 0
# exit_now <- FALSE
# while ((i <= totAttr) & !exit_now) {
# accumWeight = accumWeight + rankedAttributes[i,]
# if ((accumWeight/totalWeight) >= maxThreshold) {
# exit_now <- TRUE
# } else {
# i <- i + 1
# }
# }
# lowImportance <- rankedAttributes[(i+1):(totAttr),,drop=FALSE]
# lowAttributes <- rownames(lowImportance)
# cat('Number of attributes contributed to the importance threshold:',i,"\n")
# cat('Number of attributes found to be of low importance:',length(lowAttributes))
# Sample code for performing Attribute Importance Ranking (Block #3 of 3)
# Removing the unselected attributes from the training and validation dataframes
# Xy_train <- Xy_train[, !(names(Xy_train) %in% lowAttributes)]
# Xy_test <- Xy_test[, !(names(Xy_test) %in% lowAttributes)]
# Not applicable for this iteration of the project
# Sample code for performing Recursive Feature Elimination (Block #1 of 2)
# startTimeModule <- proc.time()
# set.seed(seedNum)
# rfeCTRL <- rfeControl(functions=rfFuncs, method="cv", number=10)
# rfeResults <- rfe(Xy_train[,1:totAttr], Xy_train[,totCol], sizes=c(30:55), rfeControl=rfeCTRL)
# print(rfeResults)
# rfeAttributes <- predictors(rfeResults)
# cat('Number of attributes identified from the RFE algorithm:',length(rfeAttributes))
# print(rfeAttributes)
# plot(rfeResults, type=c("g", "o"))
# Sample code for performing Recursive Feature Elimination (Block #2 of 3)
# Removing the unselected attributes from the training and validation dataframes
# rfeAttributes <- c(rfeAttributes,"targetVar")
# Xy_train <- Xy_train[, (names(Xy_train) %in% rfeAttributes)]
# Xy_test <- Xy_test[, (names(Xy_test) %in% rfeAttributes)]
# We finalize the training and testing datasets for the modeling activities
dim(Xy_train)
## [1] 157 61
dim(Xy_test)
## [1] 51 61
if (notifyStatus) email_notify(paste("Data Cleaning and Transformation completed!",date()))
proc.time()-startTimeScript
## user system elapsed
## 25.377 0.382 25.824
After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the training. The typical evaluation tasks include:
For this project, we will evaluate one linear, one non-linear, and three ensemble algorithms:
Linear Algorithm: Logistic Regression
Non-Linear Algorithm: Decision Trees (CART)
Ensemble Algorithms: Bagged CART, Random Forest, and Gradient Boosting
The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.
# Logistic Regression (Classification)
if (notifyStatus) email_notify(paste("Logistic Regression modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.glm <- train(targetVar~., data=Xy_train, method="glm", metric=metricTarget, trControl=control)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(fit.glm)
## Generalized Linear Model
##
## 157 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 141, 142, 141, 142, 141, 142, ...
## Resampling results:
##
## Accuracy Kappa
## 0.6938235 0.3878228
proc.time()-startTimeModule
## user system elapsed
## 0.962 0.048 0.988
if (notifyStatus) email_notify(paste("Logistic Regression modeling completed!",date()))
# Decision Tree - CART (Regression/Classification)
if (notifyStatus) email_notify(paste("Decision Tree modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=Xy_train, method="rpart", metric=metricTarget, trControl=control)
print(fit.cart)
## CART
##
## 157 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 141, 142, 141, 142, 141, 142, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.04109589 0.7022059 0.4002768
## 0.07534247 0.7205392 0.4338938
## 0.52054795 0.5913725 0.1316708
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.07534247.
proc.time()-startTimeModule
## user system elapsed
## 1.000 0.158 0.982
if (notifyStatus) email_notify(paste("Decision Tree modeling completed!",date()))
In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.
# Bagged CART (Regression/Classification)
if (notifyStatus) email_notify(paste("Bagged CART modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=Xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART
##
## 157 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 141, 142, 141, 142, 141, 142, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7839216 0.5589413
proc.time()-startTimeModule
## user system elapsed
## 4.403 0.320 4.354
if (notifyStatus) email_notify(paste("Bagged CART modeling completed!",date()))
# Random Forest (Regression/Classification)
if (notifyStatus) email_notify(paste("Random Forest modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=Xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest
##
## 157 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 141, 142, 141, 142, 141, 142, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8544363 0.7031428
## 31 0.7913725 0.5731117
## 60 0.7788725 0.5468892
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
## user system elapsed
## 7.367 0.094 7.408
if (notifyStatus) email_notify(paste("Random Forest modeling completed!",date()))
# Gradient Boosting (Regression/Classification)
if (notifyStatus) email_notify(paste("Gradient Boosting modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.gbm <- train(targetVar~., data=Xy_train, method="xgbTree", metric=metricTarget, trControl=control, verbose=F)
# fit.gbm <- train(targetVar~., data=Xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## eXtreme Gradient Boosting
##
## 157 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 141, 142, 141, 142, 141, 142, ...
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree subsample nrounds Accuracy
## 0.3 1 0.6 0.50 50 0.8168873
## 0.3 1 0.6 0.50 100 0.8168873
## 0.3 1 0.6 0.50 150 0.8035539
## 0.3 1 0.6 0.75 50 0.8094363
## 0.3 1 0.6 0.75 100 0.8230882
## 0.3 1 0.6 0.75 150 0.8160539
## 0.3 1 0.6 1.00 50 0.7800735
## 0.3 1 0.6 1.00 100 0.8047549
## 0.3 1 0.6 1.00 150 0.8176716
## 0.3 1 0.8 0.50 50 0.7836029
## 0.3 1 0.8 0.50 100 0.8231373
## 0.3 1 0.8 0.50 150 0.8098039
## 0.3 1 0.8 0.75 50 0.8039706
## 0.3 1 0.8 0.75 100 0.8168382
## 0.3 1 0.8 0.75 150 0.8160539
## 0.3 1 0.8 1.00 50 0.8117892
## 0.3 1 0.8 1.00 100 0.8235049
## 0.3 1 0.8 1.00 150 0.8176716
## 0.3 2 0.6 0.50 50 0.8239216
## 0.3 2 0.6 0.50 100 0.8176716
## 0.3 2 0.6 0.50 150 0.8243382
## 0.3 2 0.6 0.75 50 0.7985049
## 0.3 2 0.6 0.75 100 0.8293382
## 0.3 2 0.6 0.75 150 0.8230392
## 0.3 2 0.6 1.00 50 0.8110049
## 0.3 2 0.6 1.00 100 0.8110049
## 0.3 2 0.6 1.00 150 0.8364216
## 0.3 2 0.8 0.50 50 0.8231373
## 0.3 2 0.8 0.50 100 0.8548529
## 0.3 2 0.8 0.50 150 0.8481863
## 0.3 2 0.8 0.75 50 0.8176225
## 0.3 2 0.8 0.75 100 0.8360049
## 0.3 2 0.8 0.75 150 0.8168382
## 0.3 2 0.8 1.00 50 0.8168382
## 0.3 2 0.8 1.00 100 0.8301716
## 0.3 2 0.8 1.00 150 0.8426716
## 0.3 3 0.6 0.50 50 0.8239216
## 0.3 3 0.6 0.50 100 0.8239216
## 0.3 3 0.6 0.50 150 0.8301716
## 0.3 3 0.6 0.75 50 0.8540196
## 0.3 3 0.6 0.75 100 0.8481863
## 0.3 3 0.6 0.75 150 0.8606863
## 0.3 3 0.6 1.00 50 0.8376225
## 0.3 3 0.6 1.00 100 0.8305392
## 0.3 3 0.6 1.00 150 0.8489216
## 0.3 3 0.8 0.50 50 0.7988725
## 0.3 3 0.8 0.50 100 0.8047059
## 0.3 3 0.8 0.50 150 0.7980392
## 0.3 3 0.8 0.75 50 0.8165196
## 0.3 3 0.8 0.75 100 0.8298039
## 0.3 3 0.8 0.75 150 0.8235539
## 0.3 3 0.8 1.00 50 0.8047549
## 0.3 3 0.8 1.00 100 0.8114216
## 0.3 3 0.8 1.00 150 0.8055392
## 0.4 1 0.6 0.50 50 0.8102206
## 0.4 1 0.6 0.50 100 0.8172059
## 0.4 1 0.6 0.50 150 0.8418873
## 0.4 1 0.6 0.75 50 0.7926225
## 0.4 1 0.6 0.75 100 0.7984559
## 0.4 1 0.6 0.75 150 0.8109559
## 0.4 1 0.6 1.00 50 0.7984559
## 0.4 1 0.6 1.00 100 0.8176716
## 0.4 1 0.6 1.00 150 0.8239216
## 0.4 1 0.8 0.50 50 0.8106373
## 0.4 1 0.8 0.50 100 0.7789216
## 0.4 1 0.8 0.50 150 0.7847549
## 0.4 1 0.8 0.75 50 0.7804902
## 0.4 1 0.8 0.75 100 0.8176716
## 0.4 1 0.8 0.75 150 0.8110049
## 0.4 1 0.8 1.00 50 0.7918382
## 0.4 1 0.8 1.00 100 0.8047549
## 0.4 1 0.8 1.00 150 0.8051225
## 0.4 2 0.6 0.50 50 0.8297549
## 0.4 2 0.6 0.50 100 0.8544363
## 0.4 2 0.6 0.50 150 0.8481373
## 0.4 2 0.6 0.75 50 0.8152206
## 0.4 2 0.6 0.75 100 0.8414706
## 0.4 2 0.6 0.75 150 0.8348039
## 0.4 2 0.6 1.00 50 0.8301716
## 0.4 2 0.6 1.00 100 0.8485539
## 0.4 2 0.6 1.00 150 0.8543873
## 0.4 2 0.8 0.50 50 0.8239216
## 0.4 2 0.8 0.50 100 0.8426716
## 0.4 2 0.8 0.50 150 0.8356373
## 0.4 2 0.8 0.75 50 0.8101716
## 0.4 2 0.8 0.75 100 0.8097549
## 0.4 2 0.8 0.75 150 0.8164216
## 0.4 2 0.8 1.00 50 0.8427206
## 0.4 2 0.8 1.00 100 0.8427206
## 0.4 2 0.8 1.00 150 0.8489706
## 0.4 3 0.6 0.50 50 0.8355882
## 0.4 3 0.6 0.50 100 0.8547549
## 0.4 3 0.6 0.50 150 0.8668873
## 0.4 3 0.6 0.75 50 0.8611029
## 0.4 3 0.6 0.75 100 0.8611029
## 0.4 3 0.6 0.75 150 0.8423529
## 0.4 3 0.6 1.00 50 0.8560049
## 0.4 3 0.6 1.00 100 0.8626716
## 0.4 3 0.6 1.00 150 0.8626716
## 0.4 3 0.8 0.50 50 0.8223529
## 0.4 3 0.8 0.50 100 0.8348529
## 0.4 3 0.8 0.50 150 0.8286029
## 0.4 3 0.8 0.75 50 0.8360539
## 0.4 3 0.8 0.75 100 0.8235049
## 0.4 3 0.8 0.75 150 0.8235049
## 0.4 3 0.8 1.00 50 0.8114216
## 0.4 3 0.8 1.00 100 0.8176716
## 0.4 3 0.8 1.00 150 0.8239216
## Kappa
## 0.6315823
## 0.6329939
## 0.6048946
## 0.6147303
## 0.6418262
## 0.6280605
## 0.5571303
## 0.6057445
## 0.6314499
## 0.5654489
## 0.6446709
## 0.6159590
## 0.6035579
## 0.6298326
## 0.6289612
## 0.6198738
## 0.6444349
## 0.6314499
## 0.6464896
## 0.6332833
## 0.6465555
## 0.5931558
## 0.6571509
## 0.6449065
## 0.6185050
## 0.6185639
## 0.6696853
## 0.6417337
## 0.7073266
## 0.6940270
## 0.6297837
## 0.6693100
## 0.6298126
## 0.6299186
## 0.6570062
## 0.6840248
## 0.6411224
## 0.6427948
## 0.6536403
## 0.7050803
## 0.6927612
## 0.7185875
## 0.6685572
## 0.6556991
## 0.6947216
## 0.5943765
## 0.6089801
## 0.5954543
## 0.6279501
## 0.6545052
## 0.6420052
## 0.6032361
## 0.6165039
## 0.6051111
## 0.6178218
## 0.6321596
## 0.6836622
## 0.5812686
## 0.5934809
## 0.6201130
## 0.5948287
## 0.6322755
## 0.6447755
## 0.6149270
## 0.5512020
## 0.5653241
## 0.5561618
## 0.6315188
## 0.6194669
## 0.5814255
## 0.6048712
## 0.6059784
## 0.6589752
## 0.7081875
## 0.6949773
## 0.6246251
## 0.6763298
## 0.6649650
## 0.6562322
## 0.6942634
## 0.7051074
## 0.6406182
## 0.6801589
## 0.6686234
## 0.6142405
## 0.6110694
## 0.6271997
## 0.6805868
## 0.6818163
## 0.6943163
## 0.6673660
## 0.7068641
## 0.7307625
## 0.7178303
## 0.7178303
## 0.6803367
## 0.7076643
## 0.7203720
## 0.7203720
## 0.6411488
## 0.6685886
## 0.6560886
## 0.6686832
## 0.6440169
## 0.6440169
## 0.6148372
## 0.6290449
## 0.6427942
##
## Tuning parameter 'gamma' was held constant at a value of 0
##
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 150, max_depth = 3,
## eta = 0.4, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1
## and subsample = 0.5.
proc.time()-startTimeModule
## user system elapsed
## 39.057 3.010 22.367
if (notifyStatus) email_notify(paste("Gradient Boosting modeling completed!",date()))
results <- resamples(list(LR=fit.glm, CART=fit.cart, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: LR, CART, BagCART, RF, GBM
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LR 0.5882353 0.6354167 0.6666667 0.6938235 0.734375 0.9375 0
## CART 0.5625000 0.6519608 0.6770833 0.7205392 0.800000 1.0000 0
## BagCART 0.5625000 0.7127451 0.7666667 0.7839216 0.853125 1.0000 0
## RF 0.7500000 0.8031250 0.8180147 0.8544363 0.918750 1.0000 0
## GBM 0.6875000 0.7901961 0.8708333 0.8668873 0.937500 1.0000 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LR 0.15602837 0.2772349 0.3360752 0.3878228 0.4687500 0.8709677 0
## CART 0.06666667 0.2967317 0.3614130 0.4338938 0.6033107 1.0000000 0
## BagCART 0.09677419 0.4115087 0.5377271 0.5589413 0.7053571 1.0000000 0
## RF 0.47540984 0.6045532 0.6341783 0.7031428 0.8361486 1.0000000 0
## GBM 0.35483871 0.5776515 0.7324888 0.7307625 0.8750000 1.0000000 0
dotplot(results)
cat('The average accuracy from all models is:',
mean(c(results$values$`LR~Accuracy`,results$values$`CART~Accuracy`,results$values$`BagCART~Accuracy`,results$values$`RF~Accuracy`,results$values$`GBM~Accuracy`)))
## The average accuracy from all models is: 0.7839216
After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.
Using the three best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.
Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.
# Tuning algorithm #1 - Random Forest
if (notifyStatus) email_notify(paste("Algorithm #1 tuning has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(mtry=c(2,15,30,45,60))
fit.final1 <- train(targetVar~., data=Xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.final1)
## Random Forest
##
## 157 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 141, 142, 141, 142, 141, 142, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8544363 0.7031428
## 31 0.7913725 0.5731117
## 60 0.7788725 0.5468892
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
## user system elapsed
## 7.231 0.021 7.267
if (notifyStatus) email_notify(paste("Algorithm #1 tuning completed!",date()))
# Tuning algorithm #2 - Gradient Boosting
if (notifyStatus) email_notify(paste("Algorithm #2 tuning has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(nrounds=c(100,200,300,400,500), max_depth=3, eta=0.4, gamma=0, colsample_bytree=0.6, min_child_weight=1, subsample=0.5)
fit.final2 <- train(targetVar~., data=Xy_train, method="xgbTree", metric=metricTarget, tuneGrid=grid, trControl=control, verbose=F)
plot(fit.final2)
print(fit.final2)
## eXtreme Gradient Boosting
##
## 157 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 141, 142, 141, 142, 141, 142, ...
## Resampling results across tuning parameters:
##
## nrounds Accuracy Kappa
## 100 0.7965196 0.5896322
## 200 0.7706863 0.5401764
## 300 0.7706863 0.5401764
## 400 0.7714706 0.5408377
## 500 0.7711029 0.5384675
##
## Tuning parameter 'max_depth' was held constant at a value of 3
## 0.6
## Tuning parameter 'min_child_weight' was held constant at a value of
## 1
## Tuning parameter 'subsample' was held constant at a value of 0.5
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 100, max_depth = 3,
## eta = 0.4, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1
## and subsample = 0.5.
proc.time()-startTimeModule
## user system elapsed
## 3.731 0.199 2.380
if (notifyStatus) email_notify(paste("Algorithm #2 tuning completed!",date()))
results <- resamples(list(RF=fit.final1, GBM=fit.final2))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: RF, GBM
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.750 0.803125 0.8180147 0.8544363 0.9187500 1.0000000 0
## GBM 0.625 0.737500 0.8125000 0.7965196 0.8558824 0.9333333 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.4754098 0.6045532 0.6341783 0.7031428 0.8361486 1.0000000 0
## GBM 0.2380952 0.4850848 0.6250000 0.5896322 0.7050290 0.8672566 0
dotplot(results)
Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:
if (notifyStatus) email_notify(paste("Model Validation and Final Model Creation has begun!",date()))
predictions <- predict(fit.final1, newdata=Xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction M R
## M 23 6
## R 4 18
##
## Accuracy : 0.8039
## 95% CI : (0.6688, 0.9018)
## No Information Rate : 0.5294
## P-Value [Acc > NIR] : 4.341e-05
##
## Kappa : 0.6047
##
## Mcnemar's Test P-Value : 0.7518
##
## Sensitivity : 0.8519
## Specificity : 0.7500
## Pos Pred Value : 0.7931
## Neg Pred Value : 0.8182
## Prevalence : 0.5294
## Detection Rate : 0.4510
## Detection Prevalence : 0.5686
## Balanced Accuracy : 0.8009
##
## 'Positive' Class : M
##
pred <- prediction(as.numeric(predictions), as.numeric(y_test))
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, colorize=TRUE)
auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
cat('Area under the curve is:', auc)
## Area under the curve is: 0.8009259
predictions <- predict(fit.final2, newdata=Xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction M R
## M 19 4
## R 8 20
##
## Accuracy : 0.7647
## 95% CI : (0.6251, 0.8721)
## No Information Rate : 0.5294
## P-Value [Acc > NIR] : 0.0004667
##
## Kappa : 0.5321
##
## Mcnemar's Test P-Value : 0.3864762
##
## Sensitivity : 0.7037
## Specificity : 0.8333
## Pos Pred Value : 0.8261
## Neg Pred Value : 0.7143
## Prevalence : 0.5294
## Detection Rate : 0.3725
## Detection Prevalence : 0.4510
## Balanced Accuracy : 0.7685
##
## 'Positive' Class : M
##
pred <- prediction(as.numeric(predictions), as.numeric(y_test))
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, colorize=TRUE)
auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
cat('Area under the curve is:', auc)
## Area under the curve is: 0.7685185
startTimeModule <- proc.time()
set.seed(seedNum)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
# Combining datasets to form a complete dataset that will be used to train the final model
Xy_complete <- rbind(Xy_train, Xy_test)
finalModel <- randomForest(targetVar~., Xy_complete, mtry=2, na.action=na.omit)
summary(finalModel)
## Length Class Mode
## call 5 -none- call
## type 1 -none- character
## predicted 208 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 416 matrix numeric
## oob.times 208 -none- numeric
## classes 2 -none- character
## importance 60 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 208 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
proc.time()-startTimeModule
## user system elapsed
## 0.291 0.001 0.293
#saveRDS(finalModel, "./finalModel_BinaryClass.rds")
if (notifyStatus) email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
proc.time()-startTimeScript
## user system elapsed
## 90.704 4.291 73.156